Fix perf issues discovered in "For software performance, can you always trust inlining" blog #61408

EgorBo · 2021-11-10T10:58:28Z

In "For software performance, can you always trust inlining?" blog post @lemire discovered two issues with the following C# code:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
unsafe static bool is_made_of_sixteen_digits(byte* chars)
{
    Vector128<sbyte> ascii0 = Vector128.Create((sbyte)47);
    Vector128<sbyte> after_ascii9 = Vector128.Create((sbyte)58);
    Vector128<sbyte> raw = Sse41.LoadDquVector128((sbyte*)chars);
    var a = Sse2.CompareGreaterThan(raw, ascii0);
    var b = Sse2.CompareLessThan(raw, after_ascii9);
    var c = Sse2.Subtract(a, b); // this is not optimal   
    return (Sse41.TestZ(c, c));
}

unsafe static int ParseNumberString(byte* p, byte* pend)
{
    if ((p + 16 <= pend) && is_made_of_sixteen_digits(p))
        if ((p + 32 <= pend) && is_made_of_sixteen_digits(p + 16))
            return 2;
        return 1;
    return 0;
}

is_made_of_sixteen_digits is not inlined into ParseNumberString without [AggressiveInlining]
both Vector128.Create() are not CSE'd after inlining.

This PR fixes both issues, new codegen: https://www.diffchecker.com/oFYTlikO (and [AggressiveInlining] is not needed any more)

1. Inlining

Actually is_made_of_sixteen_digits is inlined here in .NET 6.0 without any changes, but only if promoted from tier0 to tier1 naturally. I enabled "call" opcode resolving for TieredCompilation=0 mode or/and for methods with loops which currently bypass Tier0. Now is_made_of_sixteen_digits is always inlineable because inliner understands what those calls are:

Inline candidate looks like a wrapper method.  Multiplier increased to 1.
Inline candidate has SIMD type args, locals or return value.  Multiplier increased to 4.
Inline has 5 intrinsics.  Multiplier increased to 6.5.
Inline has 2 foldable intrinsics.  Multiplier increased to 9.5.
Inline candidate callsite is boring.  Multiplier increased to 10.8.
calleeNativeSizeEstimate=816
callsiteNativeSizeEstimate=85
benefit multiplier=10.8
threshold=918
Native estimate for function size is within threshold for inlining 81.6 <= 91.8 (multiplier = 10.8)

Will see how this impacts JIT throughput in TC=0 mode. Perhaps, I should only enable it for methods with loops when TC_QJFL is 0 (default).

2. Enable CSE for `Vector128.Create(42)`

We do support CSE for simd operations but it turns out Vector128.Create(42) had super low "cost" (1) so CSE used to always gave up on it, e.g.:

static bool Caller()
{
    return Callee() && Callee();
}

static bool Callee()
{
    var a = Vector128.Create(42);
    var b = Vector128.Create(43);
    return Sse41.TestZ(a, b);
}

After Inlining CSE refused to optimize Create(42):

Considering CSE #02 {$c2 , $4  } [def=100.000000, use=50.000000, cost=  2      ]
CSE Expression : 
N002 (  2,  2) CSE #02 (def)[000016] ------------              *  HWINTRINSIC simd16 int Create $c2
N001 (  1,  1)              [000015] ------------              \--*  CNS_INT   int    43 $42

Aggressive CSE Promotion (250.000000 >= 200.000000)
cseRefCnt=250.000000, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000
defCnt=100.000000, useCnt=50.000000, cost=2, size=2
def_cost=1, use_cost=1, extra_no_cost=2, extra_yes_cost=0
CSE cost savings check (102.000000 >= 150.000000) fails
Did Not promote this CSE

After my change:

Considering CSE #01 {$c0 , $3  } [def=100.000000, use=50.000000, cost=  3      ]
CSE Expression : 
N002 (  3,  3) CSE #01 (def)[000011] ------------              *  HWINTRINSIC simd16 int Create $c0
N001 (  1,  1)              [000010] ------------              \--*  CNS_INT   int    42 $41

Aggressive CSE Promotion (250.000000 >= 200.000000)
cseRefCnt=250.000000, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000
defCnt=100.000000, useCnt=50.000000, cost=3, size=3
def_cost=1, use_cost=1, extra_no_cost=4, extra_yes_cost=0
CSE cost savings check (154.000000 >= 150.000000) passes

Promoting CSE:

Codegen diff for Caller:

; Method Program:Caller():bool
G_M51476_IG01:
       vzeroupper 
G_M51476_IG02:
       vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       vmovupd  xmm1, xmmword ptr [reloc @RWD16]
       vptest   xmm0, xmm1
       sete     al
       movzx    rax, al
       test     eax, eax
       je       SHORT G_M51476_IG05
-G_M51476_IG03:
-       vmovupd  xmm0, xmmword ptr [reloc @RWD00]
-       vmovupd  xmm0, xmmword ptr [reloc @RWD16]
-G_M51476_IG04:
       ret      
G_M51476_IG05:
       xor      eax, eax
G_M51476_IG06:
       ret      

RWD00  	dq	0000002A0000002Ah, 0000002A0000002Ah
RWD16  	dq	0000002B0000002Bh, 0000002B0000002Bh
-; Total bytes of code: 54
+; Total bytes of code: 38

@dotnet/jit-contrib @jakobbotsch (CSE-area owner) @tannergooding

Will post diffs a bit later

…n no-TC mode

ghost · 2021-11-10T11:17:32Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

In "For software performance, can you always trust inlining?" blog post @lemire discovered two issues with the following C# code:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
unsafe static bool is_made_of_sixteen_digits(byte* chars)
{
    Vector128<sbyte> ascii0 = Vector128.Create((sbyte)47);
    Vector128<sbyte> after_ascii9 = Vector128.Create((sbyte)58);
    Vector128<sbyte> raw = Sse41.LoadDquVector128((sbyte*)chars);
    var a = Sse2.CompareGreaterThan(raw, ascii0);
    var b = Sse2.CompareLessThan(raw, after_ascii9);
    var c = Sse2.Subtract(a, b); // this is not optimal   
    return (Sse41.TestZ(c, c));
}

unsafe static int ParseNumberString(byte* p, byte* pend)
{
    if ((p + 16 <= pend) && is_made_of_sixteen_digits(p))
        if ((p + 32 <= pend) && is_made_of_sixteen_digits(p + 16))
            return 2;
        return 1;
    return 0;
}

is_made_of_sixteen_digits is not inlined into ParseNumberString without [AggressiveInlining]
both Vector128.Create() are not CSE'd after inlining.

This PR fixes both issues, new codegen: https://www.diffchecker.com/oFYTlikO (and [AggressiveInlining] is not needed any more)

1. Inlining

First, I enabled "call" opcode resolving for TieredCompilation=0 mode or/and for methods with loops which currently bypass Tier0. Now is_made_of_sixteen_digits is always inlineable because inliner understands what those calls are:

Inline candidate looks like a wrapper method.  Multiplier increased to 1.
Inline candidate has SIMD type args, locals or return value.  Multiplier increased to 4.
Inline has 5 intrinsics.  Multiplier increased to 6.5.
Inline has 2 foldable intrinsics.  Multiplier increased to 9.5.
Inline candidate callsite is boring.  Multiplier increased to 10.8.
calleeNativeSizeEstimate=816
callsiteNativeSizeEstimate=85
benefit multiplier=10.8
threshold=918
Native estimate for function size is within threshold for inlining 81.6 <= 91.8 (multiplier = 10.8)

Will see how this impacts JIT throughput in TC=0 mode.

2. Enable CSE for `Vector128.Create(42)`

We do support CSE for simd operations but it turns out Vector128.Create(42) had super low "cost" (1) so CSE used to always gave up on it, e.g.:

static bool Caller()
{
    return Callee() && Callee();
}

static bool Callee()
{
    var a = Vector128.Create(42);
    var b = Vector128.Create(43);
    return Sse41.TestZ(a, b);
}

After Inlining CSE refused to optimize Create(42):

Considering CSE #02 {$c2 , $4  } [def=100.000000, use=50.000000, cost=  2      ]
CSE Expression : 
N002 (  2,  2) CSE #02 (def)[000016] ------------              *  HWINTRINSIC simd16 int Create $c2
N001 (  1,  1)              [000015] ------------              \--*  CNS_INT   int    43 $42

Aggressive CSE Promotion (250.000000 >= 200.000000)
cseRefCnt=250.000000, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000
defCnt=100.000000, useCnt=50.000000, cost=2, size=2
def_cost=1, use_cost=1, extra_no_cost=2, extra_yes_cost=0
CSE cost savings check (102.000000 >= 150.000000) fails
Did Not promote this CSE

After my change:

Considering CSE #01 {$c0 , $3  } [def=100.000000, use=50.000000, cost=  3      ]
CSE Expression : 
N002 (  3,  3) CSE #01 (def)[000011] ------------              *  HWINTRINSIC simd16 int Create $c0
N001 (  1,  1)              [000010] ------------              \--*  CNS_INT   int    42 $41

Aggressive CSE Promotion (250.000000 >= 200.000000)
cseRefCnt=250.000000, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000
defCnt=100.000000, useCnt=50.000000, cost=3, size=3
def_cost=1, use_cost=1, extra_no_cost=4, extra_yes_cost=0
CSE cost savings check (154.000000 >= 150.000000) passes

Promoting CSE:

Codegen diff for Caller:

; Method Program:Caller():bool
G_M51476_IG01:
       vzeroupper 
G_M51476_IG02:
       vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       vmovupd  xmm1, xmmword ptr [reloc @RWD16]
       vptest   xmm0, xmm1
       sete     al
       movzx    rax, al
       test     eax, eax
       je       SHORT G_M51476_IG05
-G_M51476_IG03:
-       vmovupd  xmm0, xmmword ptr [reloc @RWD00]
-       vmovupd  xmm0, xmmword ptr [reloc @RWD16]
-G_M51476_IG04:
       ret      
G_M51476_IG05:
       xor      eax, eax
G_M51476_IG06:
       ret      

RWD00  	dq	0000002A0000002Ah, 0000002A0000002Ah
RWD16  	dq	0000002B0000002Bh, 0000002B0000002Bh
-; Total bytes of code: 54
+; Total bytes of code: 38

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2021-11-10T13:13:25Z

Inliner diffs (jit-diff tool, --pmi mode):

PMI CodeSize Diffs for System.Private.CoreLib.dll for  default jit

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 5536930
Total bytes of diff: 5589051
Total bytes of delta: 52121 (0.94 % of base)
Total relative delta: 394.76
    diff is a regression.
    relative diff is a regression.


Top file regressions (bytes):
       52121 : System.Private.CoreLib.dasm (0.94% of base)

1 total files with Code Size differences (0 improved, 1 regressed), 0 unchanged.

Top method regressions (bytes):
        1402 (289.07% of base) : System.Private.CoreLib.dasm - ValueTuple`1:System.IValueTupleInternal.ToStringEnd():String:this (8 methods)
        1208 (15.85% of base) : System.Private.CoreLib.dasm - WhenAllPromise`1:Invoke(Task):this (8 methods)
         700 (167.06% of base) : System.Private.CoreLib.dasm - HashSet`1:EqualityComparersAreEqual(HashSet`1,HashSet`1):bool (8 methods)
         664 (206.21% of base) : System.Private.CoreLib.dasm - BadImageFormatException:ToString():String:this
         619 (86.82% of base) : System.Private.CoreLib.dasm - TypeNameBuilder:AddAssemblyQualifiedName(Type,int):this
         613 (12.23% of base) : System.Private.CoreLib.dasm - Dictionary`2:GetObjectData(SerializationInfo,StreamingContext):this (8 methods)
         610 (12.11% of base) : System.Private.CoreLib.dasm - HashSet`1:GetObjectData(SerializationInfo,StreamingContext):this (8 methods)
         586 (13.63% of base) : System.Private.CoreLib.dasm - ArraySortHelper`1:IntroSort(Span`1,int,Comparison`1) (9 methods)
         519 (52.16% of base) : System.Private.CoreLib.dasm - TextInfo:ToTitleCase(String):String:this
         516 (56.27% of base) : System.Private.CoreLib.dasm - ConcurrentQueue`1:System.Collections.ICollection.CopyTo(Array,int):this (8 methods)
         494 (81.92% of base) : System.Private.CoreLib.dasm - ContractHelper:GetFailureMessage(int,String):String
         490 (103.81% of base) : System.Private.CoreLib.dasm - HashSet`1:.ctor(int,IEqualityComparer`1):this (8 methods)
         482 (113.95% of base) : System.Private.CoreLib.dasm - ManifestBuilder:WriteMessageAttrib(StringBuilder,String,String,String):this
         474 (14.31% of base) : System.Private.CoreLib.dasm - CustomAttributeTypedArgument:ToString(bool):String:this
         468 (77.61% of base) : System.Private.CoreLib.dasm - EventProvider:GetDataFromController(int,long,byref,byref,byref):bool:this
         467 (10.47% of base) : System.Private.CoreLib.dasm - DateTimeFormatInfo:CreateTokenHashTable():ref:this
         461 (105.25% of base) : System.Private.CoreLib.dasm - Lazy`1:PublicationOnlyViaConstructor(LazyHelper):this (8 methods)
         459 (24.88% of base) : System.Private.CoreLib.dasm - HashSet`1:.ctor(IEnumerable`1,IEqualityComparer`1):this (8 methods)
         449 (231.44% of base) : System.Private.CoreLib.dasm - EventSource:ReportOutOfBandMessage(String):this
         445 (64.77% of base) : System.Private.CoreLib.dasm - CultureInfo:GetCultureInfo(String,String):CultureInfo

Top method improvements (bytes):
        -807 (-11.16% of base) : System.Private.CoreLib.dasm - String:Concat(IEnumerable`1):String (9 methods)
        -693 (-9.63% of base) : System.Private.CoreLib.dasm - String:JoinCore(ReadOnlySpan`1,IEnumerable`1):String (8 methods)
        -552 (-11.36% of base) : System.Private.CoreLib.dasm - Vector128`1:ToString(String,IFormatProvider):String:this (6 methods)
        -529 (-10.75% of base) : System.Private.CoreLib.dasm - Vector`1:ToString(String,IFormatProvider):String:this (6 methods)
        -529 (-10.72% of base) : System.Private.CoreLib.dasm - Vector256`1:ToString(String,IFormatProvider):String:this (6 methods)
        -526 (-10.27% of base) : System.Private.CoreLib.dasm - Vector64`1:ToString(String,IFormatProvider):String:this (6 methods)
        -503 (-14.17% of base) : System.Private.CoreLib.dasm - StringSerializer:GetSerializedString(TimeZoneInfo):String
        -448 (-24.71% of base) : System.Private.CoreLib.dasm - Dictionary`2:.ctor(int,IEqualityComparer`1):this (12 base, 10 diff methods)
        -284 (-24.76% of base) : System.Private.CoreLib.dasm - RuntimeMethodInfo:ToString():String:this
        -207 (-13.13% of base) : System.Private.CoreLib.dasm - ApplicationId:ToString():String:this
        -204 (-30.09% of base) : System.Private.CoreLib.dasm - RTDynamicMethod:ToString():String:this
        -188 (-24.67% of base) : System.Private.CoreLib.dasm - RuntimePropertyInfo:ToString():String:this
        -128 (-12.12% of base) : System.Private.CoreLib.dasm - IdManager:.ctor():this (8 methods)
        -120 (-18.07% of base) : System.Private.CoreLib.dasm - SystemThreadingTasks_FutureDebugView`1:get_CancellationPending():bool:this (8 methods)
        -106 (-10.76% of base) : System.Private.CoreLib.dasm - ArraySortHelper`1:SwapIfGreater(Span`1,Comparison`1,int,int) (9 base, 8 diff methods)
        -106 (-5.87% of base) : System.Private.CoreLib.dasm - GateThread:GateThreadStart()
        -100 (-12.41% of base) : System.Private.CoreLib.dasm - String:JoinCore(ReadOnlySpan`1,ref):String
         -94 (-15.59% of base) : System.Private.CoreLib.dasm - RuntimeConstructorInfo:ToString():String:this
         -94 (-5.48% of base) : System.Private.CoreLib.dasm - String:Join(String,IEnumerable`1):String (9 methods)
         -93 (-7.32% of base) : System.Private.CoreLib.dasm - CustomAttributeData:ToString():String:this

Top method regressions (percentages):
         434 (1,808.33% of base) : System.Private.CoreLib.dasm - Assembly:CreateQualifiedName(String,String):String
         153 (1,390.91% of base) : System.Private.CoreLib.dasm - SignatureHelper:AddArgument(Type):this
          67 (1,340.00% of base) : System.Private.CoreLib.dasm - Array:System.Collections.IList.IndexOf(Object):int:this
         140 (1,272.73% of base) : System.Private.CoreLib.dasm - ModuleHandle:GetRuntimeMethodHandleFromMetadataToken(int):RuntimeMethodHandle:this
         140 (1,272.73% of base) : System.Private.CoreLib.dasm - ModuleHandle:ResolveMethodHandle(int):RuntimeMethodHandle:this
          58 (1,160.00% of base) : System.Private.CoreLib.dasm - Path:EndsInDirectorySeparator(ReadOnlySpan`1):bool
         425 (988.37% of base) : System.Private.CoreLib.dasm - Vector256:Abs(Vector256`1):Vector256`1 (6 methods)
         209 (950.00% of base) : System.Private.CoreLib.dasm - EventListener:EnableEvents(EventSource,int,long):this
          46 (920.00% of base) : System.Private.CoreLib.dasm - String:op_Equality(String,String):bool
         203 (882.61% of base) : System.Private.CoreLib.dasm - EventListener:EnableEvents(EventSource,int):this
         225 (681.82% of base) : System.Private.CoreLib.dasm - RuntimeModule:GetMethodImpl(String,int,Binder,int,ref,ref):MethodInfo:this
         220 (564.10% of base) : System.Private.CoreLib.dasm - ModuleBuilder:GetMethodImpl(String,int,Binder,int,ref,ref):MethodInfo:this
          61 (554.55% of base) : System.Private.CoreLib.dasm - FastResourceComparer:Equals(String,String):bool:this
          61 (554.55% of base) : System.Private.CoreLib.dasm - NonRandomizedStringEqualityComparer:Equals(String,String):bool:this
          61 (554.55% of base) : System.Private.CoreLib.dasm - OrdinalCaseSensitiveComparer:Equals(String,String):bool:this
         415 (506.10% of base) : System.Private.CoreLib.dasm - RuntimeFieldInfo:ToString():String:this
         430 (483.15% of base) : System.Private.CoreLib.dasm - ContractHelper:GetDisplayMessage(int,String,String):String
         417 (468.54% of base) : System.Private.CoreLib.dasm - ParameterInfo:ToString():String:this
         214 (465.22% of base) : System.Private.CoreLib.dasm - Dictionary`2:TryGetValue(__Canon,byref):bool:this (1 base, 5 diff methods)
          51 (463.64% of base) : System.Private.CoreLib.dasm - Net5CompatSeedImpl:NextBytes(Span`1):this

Top method improvements (percentages):
         -34 (-33.01% of base) : System.Private.CoreLib.dasm - NullabilityInfoContext:.ctor():this
        -204 (-30.09% of base) : System.Private.CoreLib.dasm - RTDynamicMethod:ToString():String:this
         -41 (-24.85% of base) : System.Private.CoreLib.dasm - Task:HandleException(Exception):this
        -284 (-24.76% of base) : System.Private.CoreLib.dasm - RuntimeMethodInfo:ToString():String:this
        -448 (-24.71% of base) : System.Private.CoreLib.dasm - Dictionary`2:.ctor(int,IEqualityComparer`1):this (12 base, 10 diff methods)
        -188 (-24.67% of base) : System.Private.CoreLib.dasm - RuntimePropertyInfo:ToString():String:this
         -47 (-22.71% of base) : System.Private.CoreLib.dasm - Task:ExecuteEntryUnsafe(Thread):this
         -47 (-22.71% of base) : System.Private.CoreLib.dasm - Task:ExecuteFromThreadPool(Thread):this
         -47 (-18.65% of base) : System.Private.CoreLib.dasm - Task:ExecuteEntry():bool:this
         -18 (-18.18% of base) : System.Private.CoreLib.dasm - AssemblyLoadContext:get_AllContexts():Dictionary`2
         -57 (-18.15% of base) : System.Private.CoreLib.dasm - ThreadPoolTaskScheduler:TryExecuteTaskInline(Task,bool):bool:this
        -120 (-18.07% of base) : System.Private.CoreLib.dasm - SystemThreadingTasks_FutureDebugView`1:get_CancellationPending():bool:this (8 methods)
         -15 (-18.07% of base) : System.Private.CoreLib.dasm - SystemThreadingTasks_TaskDebugView:get_CancellationPending():bool:this
         -47 (-16.67% of base) : System.Private.CoreLib.dasm - <>c:<.cctor>b__10_0(Object):this (2 methods)
         -94 (-15.59% of base) : System.Private.CoreLib.dasm - RuntimeConstructorInfo:ToString():String:this
         -72 (-15.52% of base) : System.Private.CoreLib.dasm - HashSet`1:.ctor(int):this (8 methods)
        -503 (-14.17% of base) : System.Private.CoreLib.dasm - StringSerializer:GetSerializedString(TimeZoneInfo):String
         -13 (-13.68% of base) : System.Private.CoreLib.dasm - CultureInfo:get_CachedCulturesByLcid():Dictionary`2
        -207 (-13.13% of base) : System.Private.CoreLib.dasm - ApplicationId:ToString():String:this
        -100 (-12.41% of base) : System.Private.CoreLib.dasm - String:JoinCore(ReadOnlySpan`1,ref):String

453 total methods with Code Size differences (93 improved, 360 regressed), 25750 unchanged.

450 methods affected, but actually this PR doesn't change anything in codegen for R2R and for JIT with TC_QuickJitForLoops=1. So for the very default mode it only improves inlining inside methods with loops in non-prejitted code. As a bonus - developers will see a better (and closer to reality) inlining on sharplab.io

CSE for Vector.Create:

Top method improvements (bytes): -31 (-1.10% of base) : 21617.dasm - HardwareIntrinsics.RayTracer.Packet256Tracer:GetNaturalColor(System.Runtime.Intrinsics.Vector256`1[Int32],HardwareIntrinsics.RayTracer.VectorPacket256,HardwareIntrinsics.RayTracer.VectorPacket256,HardwareIntrinsics.RayTracer.VectorPacket256,HardwareIntrinsics.RayTracer.Scene):HardwareIntrinsics.RayTracer.VectorPacket256:this

Top method regressions (percentages): 5 ( 0.37% of base) : 6066.dasm - BilinearTest:BilinearInterpol_Vector(System.Double[],System.Double[],double,double,System.Double[],double,double,double):System.Double[]:this

We mostly hoist Vector.Create by hands in the BCL so diffs couldn't find anything. The regression in BilinearInterpol_Vector doesn't look like a real regression - it CSE'd Vector.Create from an inner loop: https://www.diffchecker.com/FvtaAZzj (but was not CSE'd from the outerloop then, see #61420 (comment))

jakobbotsch

LGTM.

EgorBo · 2021-11-10T15:43:13Z

Regarding JIT's TP:
#51124 (comment) a similar change for Crossgen2 decreased TP by 1.5% so similar numbers are expected for JIT in TieredCompilation=0 mode. As for the default mode where we only regress in methods with loops the total TP regression should be way lower as not all methods have loops. Moreover, we plan to eventually (hopefully in .NET 7.0) make QuickJitForLoops=1 to be default so this change won't regress anything there.

@jkotas are we fine with it?

src/coreclr/jit/fgbasic.cpp

jkotas · 2021-11-10T15:54:29Z

@jkotas are we fine with it?

Fine with me.

AndyAyersMS · 2021-11-10T16:48:15Z

Perhaps, I should only enable it for methods with loops when TC_QJFL is 0

I don't think we should take any dependence on QJFL in the jit.

AndyAyersMS · 2021-11-10T16:51:25Z

src/coreclr/jit/gentree.cpp

@@ -3964,9 +3966,32 @@ unsigned Compiler::gtSetEvalOrder(GenTree* tree)
                            goto DONE;
                        }
                    }
+#endif
+
+                    switch (hwTree->gtHWIntrinsicId)


Do we need to review costing for HW intrinsics more broadly?

We likely should. We probably aren't accounting for cases where helper intrinsics are more expensive than they appear nor cases where operands have less cost due to special handling that hardware intrinsics get.

There's also probably cases where operands (like scalar DBL_CNS) are currently participating in overall CSE and shouldn't for certain cases.

…ire-issues

…ire-issues # Conflicts: # src/coreclr/jit/gentree.cpp

tannergooding · 2021-12-01T16:26:35Z

src/coreclr/jit/gentree.cpp

+                if (hwTree->gtGetOp1()->OperIsConst() && (hwTree->gtGetOp2() == nullptr))
+                {
+                    // Vector.Create(cns) is cheap but not that cheap to be (1,1)
+                    costEx = 2;


This should be IND_COST_EX

Actually, it probably needs to be a "bit" more complex.

If all operands are constant and its not representing all bits zero or all bits set then its IND_COST_EX.

If part of the value isn't constant then the cost increases as the number of operands increases. We don't currently, but could eventually, handle "partial constants".

If the value represents all bits zero or all bits set, then its cheaper and its just xor or the relevant cmp SIMD instruction and is special cased by hardware.

@tannergooding yes I had "all zeros/ones" cases in mind but the problem that they complicate code a lot (especially the AllBitsSet case for different types) for a very rare case where usually get_Zero/get_AllBitsSet intrinsics are used. I think it won't hurt if we do CSE more often for these cases or we better move the logic to recognize get_Zero/get_AllBitsSet early in morph/importer and it will work as expected + IR will be simplified earlier.

tannergooding · 2021-12-02T15:38:06Z

src/coreclr/jit/gentree.cpp

+            case NI_Vector128_Create:
+#endif
+            {
+                if ((hwTree->GetOperandCount() == 1) && hwTree->Op(1)->OperIsConst())


Why are we only doing OperandCount == 1?

What about the cases where OperandCount == 2 through OperandCount == 32? Are those being properly tracked as "expensive" and getting CSEd?

@tannergooding yeah, they are assigned a higher cost automatically due to multiple arguments so the problem doesn't reproduce for them. but that's a good point, I guess Vector128.Create(1,2,3,4,5,6,7,8) currently gets a very high cost while in reality it should still be 3/2

I think its fine to log an issue for and cover in a separate issue here.

tannergooding

Changes LGTM.

We should log an issue to see if we can track some of the overall costs more accurately, particularly for helper intrinsics or intrinsics which are more expensive than others.

…ire-issues

EgorBo · 2021-12-07T15:27:43Z

Changes LGTM.

We should log an issue to see if we can track some of the overall costs more accurately, particularly for helper intrinsics or intrinsics which are more expensive than others.

Sure, I'm taking a quick look now, e.g. on arm64 floating point constants are never hoisted currently (unlike x64 we can't "contain" them)

EgorBo · 2021-12-07T15:28:21Z

Windows Arm64 Checked seems to be failing everywhere and is unrelated to this PR.

Make Vector128.Create more expensive, enable call opcode resolution i…

a433102

…n no-TC mode

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 10, 2021

jakobbotsch approved these changes Nov 10, 2021

View reviewed changes

EgorBo mentioned this pull request Nov 10, 2021

Hoisting the invariant out of multi-level nested loops #61420

Closed

jkotas reviewed Nov 10, 2021

View reviewed changes

src/coreclr/jit/fgbasic.cpp Show resolved Hide resolved

Remove dead code

1673244

AndyAyersMS approved these changes Nov 10, 2021

View reviewed changes

JulieLeeMSFT added this to the 7.0.0 milestone Nov 11, 2021

JulieLeeMSFT assigned EgorBo Nov 11, 2021

EgorBo added 4 commits November 15, 2021 12:53

Merge branch 'main' of https://github.com/dotnet/runtime into fix-lem…

72fb64e

…ire-issues

Merge branch 'main' of https://github.com/dotnet/runtime into fix-lem…

dddf349

…ire-issues

Merge branch 'main' of https://github.com/dotnet/runtime into fix-lem…

4f4c709

…ire-issues # Conflicts: # src/coreclr/jit/gentree.cpp

fix typo

80897b4

tannergooding reviewed Dec 1, 2021

View reviewed changes

fix build issue

3368ff2

tannergooding reviewed Dec 2, 2021

View reviewed changes

tannergooding approved these changes Dec 2, 2021

View reviewed changes

EgorBo added 2 commits December 7, 2021 11:59

Merge branch 'main' of https://github.com/dotnet/runtime into fix-lem…

bd38782

…ire-issues

Update gentree.cpp

8967fce

EgorBo merged commit 0ddc132 into dotnet:main Dec 7, 2021

EgorBo deleted the fix-lemire-issues branch December 7, 2021 15:28

This was referenced Jan 6, 2022

[Perf] Changes at 12/7/2021 3:27:58 PM #63458

Closed

[Perf] Changes at 12/7/2021 3:27:58 PM dotnet/perf-autofiling-issues#2750

Closed

kunalspathak mentioned this pull request Jan 6, 2022

[Perf] Changes at 12/7/2021 3:27:58 PM dotnet/perf-autofiling-issues#2744

Closed

ghost locked as resolved and limited conversation to collaborators Jan 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix perf issues discovered in "For software performance, can you always trust inlining" blog #61408

Fix perf issues discovered in "For software performance, can you always trust inlining" blog #61408

EgorBo commented Nov 10, 2021 •

edited

Loading

ghost commented Nov 10, 2021

1. Inlining

2. Enable CSE for `Vector128.Create(42)`

EgorBo commented Nov 10, 2021 •

edited

Loading

jakobbotsch left a comment

EgorBo commented Nov 10, 2021

jkotas commented Nov 10, 2021

AndyAyersMS commented Nov 10, 2021

AndyAyersMS Nov 10, 2021

tannergooding Dec 1, 2021

tannergooding Dec 1, 2021

tannergooding Dec 1, 2021 •

edited

Loading

EgorBo Dec 2, 2021

tannergooding Dec 2, 2021

EgorBo Dec 2, 2021

tannergooding Dec 2, 2021

tannergooding left a comment

EgorBo commented Dec 7, 2021

EgorBo commented Dec 7, 2021 •

edited

Loading

Fix perf issues discovered in "For software performance, can you always trust inlining" blog #61408

Fix perf issues discovered in "For software performance, can you always trust inlining" blog #61408

Conversation

EgorBo commented Nov 10, 2021 • edited Loading

1. Inlining

2. Enable CSE for Vector128.Create(42)

ghost commented Nov 10, 2021

1. Inlining

2. Enable CSE for Vector128.Create(42)

EgorBo commented Nov 10, 2021 • edited Loading

Inliner diffs (jit-diff tool, --pmi mode):

CSE for Vector.Create:

jakobbotsch left a comment

Choose a reason for hiding this comment

EgorBo commented Nov 10, 2021

jkotas commented Nov 10, 2021

AndyAyersMS commented Nov 10, 2021

AndyAyersMS Nov 10, 2021

Choose a reason for hiding this comment

tannergooding Dec 1, 2021

Choose a reason for hiding this comment

tannergooding Dec 1, 2021

Choose a reason for hiding this comment

tannergooding Dec 1, 2021 • edited Loading

Choose a reason for hiding this comment

EgorBo Dec 2, 2021

Choose a reason for hiding this comment

tannergooding Dec 2, 2021

Choose a reason for hiding this comment

EgorBo Dec 2, 2021

Choose a reason for hiding this comment

tannergooding Dec 2, 2021

Choose a reason for hiding this comment

tannergooding left a comment

Choose a reason for hiding this comment

EgorBo commented Dec 7, 2021

EgorBo commented Dec 7, 2021 • edited Loading

EgorBo commented Nov 10, 2021 •

edited

Loading

2. Enable CSE for `Vector128.Create(42)`

2. Enable CSE for `Vector128.Create(42)`

EgorBo commented Nov 10, 2021 •

edited

Loading

tannergooding Dec 1, 2021 •

edited

Loading

EgorBo commented Dec 7, 2021 •

edited

Loading